This journal includes the code that is used to clean the training and testing datasets.
# Load libraries
library(tidyverse)
library(plotly)
library(randomForest)
Both the training and the testing data will require the use of a vector of the names of the features used in the rtrees random forest included in the bulletr library. I create such a vector below to use in this journal.
# Obtain features used when fitting the rtrees random forest
rf_features <- rownames(bulletr::rtrees$importance)
This was originally a part of a journal entry that I wrote in my ‘Case Studies with LIME’ repository. I took the code that I used to clean the training dataset from that entry and updated it in this entry. It should still produce essentially the same dataset (with some possible changes to the level names of some of the barrels due to a ‘factor’ issue). The dataset that gets saved from this journal is the one that I am using for this research project.
The dataset loaded in below is the original Hamby 172 and 252 dataset that Heike gave to me. Note that when the hamby173and252 dataset is read in, the studies called “Cary” are excluded. The data file contains rows based on bullet scans from a different study. These rows are no longer being included since Heike has found the study they came from to be poorly executed.
# Load in the Hamby 173 and 252 dataset
hamby173and252_raw <- read.csv("../data/originals/features-hamby173and252.csv") %>%
filter(study1 != "Cary", study2 != "Cary") %>%
mutate(study1 = factor(study1),
study2 = factor(study2))
If we include symmetric comparisons, each set of test bullets should result in a dataset with \[(35 \mbox{ bullets} \times 6 \mbox{ lands})^2=44100 \mbox{ rows},\] where a row would contain information on a pair of lands. If we do not include the symmetric comparisons, then the dataset should have \[\frac{(44100 \mbox{ rows} - (35 \mbox{ bullets} \times 6))}{2} + (35 \mbox{ bullets} \times 6) = 22155 \mbox{ rows}.\] However, when I looked at the dimension of the datasets, neither of these seem to be the case. See the R code and output below. Note that hamby173 is currently incorrectly labelled as hamby44. Both test sets have less than but close to 22,155 rows. This suggests that these do not include symmetric comparisons. When I checked with Heike, she confirmed that this is the case. This table also shows that there are comparisons across hamby173 and hamby252. These missing observations will be explored more in the next section.
# Summary of the number of observations in the Hamby173and252 datase
table(hamby173and252_raw$study1, hamby173and252_raw$study2)
##
## Hamby252 Hamby44
## Hamby252 20910 16862
## Hamby44 25573 21321
The plot below considers the number of observations within a barrel and bullet comparison from all known cases in the Hamby 173 and 252 data. We can see that the observations on the lower diagonals are missing in all cases which confirms that the symmetric comparisons were not included in the data. Additionally, a handful of cases have less than 36 observations. For the comparisons within the Hamby 173 or Hamby 252 study, the cells on the diagonals are less than 36, because none of the repeats from the symmetric comparisons of lands are included. The cells above the diagonal with less than 36 observations are missing some observations due to tank rash. For the comparisons across studies, the cases with less than expected are also due to tank rash. For some reason, the comparisons between bullets 1 from Hamby 173 and Hamby 252, the cells are being colored grey even though they have 36 observations. I am not sure why this is…
# Create the plot to look at number of comparisons within the known bullets
countplot <- hamby173and252_raw %>%
filter(barrel1 %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"),
barrel2 %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")) %>%
group_by(study1, study2, barrel1, barrel2, bullet1, bullet2) %>%
summarise(count = n()) %>%
ggplot(aes(x = barrel1, y = barrel2)) +
geom_tile(aes(fill = count)) +
facet_grid(study1 + bullet1 ~ study2 + bullet2, scales = "free") +
theme_minimal() +
scale_fill_distiller(palette = "GnBu", direction = 1)
# Make the plot interactive
ggplotly(countplot, width = 800, height = 700) %>%
shiny::div(align = "center")
The code below cleans the training data. The cleaning involves:
# Determine the letters associated with the unknown bullets
letters <- levels(hamby173and252_raw$barrel1)[11:length(levels(hamby173and252_raw$barrel1))]
# Cleaning the testing data
hamby173and252_train <- hamby173and252_raw %>%
mutate(study1 = fct_recode(study1, "Hamby173" = "Hamby44"),
study2 = fct_recode(study2, "Hamby173" = "Hamby44"),
bullet1 = factor(ifelse(barrel1 %in% letters,
as.character(barrel1),
as.character(bullet1))),
barrel1 = factor(ifelse(barrel1 %in% letters,
as.character("Unknown"),
as.character(barrel1))),
bullet2 = factor(ifelse(barrel2 %in% letters,
as.character(barrel2),
as.character(bullet2))),
barrel2 = factor(ifelse(barrel2 %in% letters,
as.character("Unknown"),
as.character(barrel2))),
land1 = factor(land1),
land2 = factor(land2),
rfscore = predict(bulletr::rtrees, hamby173and252_raw %>%
select(rf_features),
type = "prob")[,2]) %>%
rename(samesource = match) %>%
select(study1, barrel1, bullet1, land1, study2, barrel2, bullet2, land2,
rf_features, samesource, rfscore)
The cleaned data is saved and used as the training dataset for the rest of this research project.
# Save the datasets and response variables as .csv files
write.csv(hamby173and252_train, "../data/hamby173and252_train.csv", row.names = FALSE)
The number of rtrees predictions does not match the length of my training data. I need to ask Heike about this.
dim(predict(bulletr::rtrees, type = "prob"))
## [1] 83028 2
dim(hamby173and252_train)
## [1] 84666 19
The original data files given to me by Heike are loaded in below. For now, we are working with only sets 1 and 11 from the Hamby 224 study. She may provide me with more in the future.
# Load in the Hamby 224 datasets
hamby224_set1 <- readRDS("../data/originals/h224-set1-features.rds")
hamby224_set11 <- readRDS("../data/originals/h224-set11-features.rds")
The code below cleans the data from both sets 1 and 11. This involves:
# Clean the Hamby 224 set 1 data
hamby224_set1_cleaned <- hamby224_set1 %>%
select(-bullet_score, -land1, -land2, -aligned, -striae, -features) %>%
rename(bullet1 = bulletA,
bullet2 = bulletB,
land1 = landA,
land2 = landB) %>%
mutate(study = factor("Hamby 224"),
set = factor("Set 1"),
bullet1 = recode(factor(bullet1),
"1" = "Known 1", "2" = "Known 2", "Q" = "Questioned"),
bullet2 = recode(factor(bullet2),
"1" = "Known 1", "2" = "Known 2", "Q" = "Questioned"),
land1 = recode(factor(land1),
"1" = "Land 1", "2" = "Land 2", "3" = "Land 3",
"4" = "Land 4", "5" = "Land 5", "6" = "Land 6"),
land2 = recode(factor(land2),
"1" = "Land 1", "2" = "Land 2", "3" = "Land 3",
"4" = "Land 4", "5" = "Land 5", "6" = "Land 6")) %>%
select(study, set, bullet1:land2, rf_features, rfscore, samesource)
# Clean the Hamby 224 set 11 data
hamby224_set11_cleaned <- hamby224_set11 %>%
select(-bullet_score, -land1, -land2, -aligned, -striae, -features) %>%
rename(bullet1 = bulletA,
bullet2 = bulletB,
land1 = landA,
land2 = landB) %>%
mutate(study = factor("Hamby 224"),
set = factor("Set 11"),
bullet1 = recode(factor(bullet1),
"Bullet 1" = "Known 1", "Bullet 2" = "Known 2",
"Bullet I" = "Questioned"),
bullet2 = recode(factor(bullet2),
"Bullet 1" = "Known 1", "Bullet 2" = "Known 2",
"Bullet I" = "Questioned")) %>%
select(study, set, bullet1:land2, rf_features, rfscore, samesource)
The cleaned data from sets 1 and 11 are combined below into the testing dataset. Rows are added for the missing comparisons from the Hamby 224 study, and some additional cleaning is done.
# Create a dataset with all combinations of lands and bullets comparisons for each set
combinations <- data.frame(set = factor(rep(c("Set 1", "Set 11"), each = 324)),
expand.grid(land1 = factor(c("Land 1", "Land 2", "Land 3",
"Land 4", "Land 5", "Land 6")),
land2 = factor(c("Land 1", "Land 2", "Land 3",
"Land 4", "Land 5", "Land 6")),
bullet1 = factor(c("Known 1", "Known 2", "Questioned")),
bullet2 = factor(c("Known 1", "Known 2", "Questioned"))))
# Join the two cleaned Hamby 224 sets into one testing set
hamby224_test <- suppressWarnings(bind_rows(hamby224_set1_cleaned,
hamby224_set11_cleaned)) %>%
mutate(set = factor(set),
bullet1 = factor(bullet1),
bullet2 = factor(bullet2),
land1 = factor(land1),
land2 = factor(land2)) %>%
right_join(combinations, by = c("set", "land1", "land2", "bullet1", "bullet2")) %>%
filter(!(bullet1 == "Questioned" & bullet2 == "Known 1"),
!(bullet1 == "Questioned" & bullet2 == "Known 2"),
!(bullet1 == "Known 2" & bullet2 == "Known 1")) %>%
arrange(rfscore) %>%
mutate(case = factor(1:length(study))) %>%
select(case, study:samesource)
The testing data file is saved below.
# Save the test data as a .csv file
write.csv(hamby224_test, "../data/hamby224_test.csv", row.names = FALSE)
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 randomForest_4.6-14 plotly_4.8.0.9000
## [4] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8
## [7] purrr_0.2.5 readr_1.1.1 tidyr_0.8.2
## [10] tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.0 jsonlite_1.6
## [3] viridisLite_0.3.0 modelr_0.1.2
## [5] shiny_1.2.0 assertthat_0.2.0
## [7] TTR_0.23-3 cellranger_1.1.0
## [9] yaml_2.2.0 robustbase_0.93-2
## [11] pillar_1.3.0 backports_1.1.2
## [13] lattice_0.20-38 glue_1.3.0
## [15] digest_0.6.18 manipulateWidget_0.10.0
## [17] RColorBrewer_1.1-2 promises_1.0.1
## [19] rvest_0.3.2 colorspace_1.3-2
## [21] htmltools_0.3.6 httpuv_1.4.5.1
## [23] plyr_1.8.4 pkgconfig_2.0.2
## [25] broom_0.5.0 haven_1.1.2
## [27] xtable_1.8-3 scales_1.0.0
## [29] webshot_0.5.0 x3ptools_0.0.1
## [31] later_0.7.5 withr_2.1.2
## [33] lazyeval_0.2.1 cli_1.0.0
## [35] magrittr_1.5 crayon_1.3.4
## [37] readxl_1.1.0 mime_0.6
## [39] evaluate_0.11 nlme_3.1-137
## [41] xts_0.11-0 xml2_1.2.0
## [43] bulletr_0.1.0.9003 tools_3.5.2
## [45] data.table_1.11.8 hms_0.4.2
## [47] smoother_1.1 munsell_0.5.0
## [49] compiler_3.5.2 rlang_0.3.1
## [51] grid_3.5.2 rstudioapi_0.7
## [53] htmlwidgets_1.3 crosstalk_1.0.0
## [55] miniUI_0.1.1.1 labeling_0.3
## [57] rmarkdown_1.10 gtable_0.2.0
## [59] curl_3.2 reshape2_1.4.3
## [61] R6_2.3.0 zoo_1.8-3
## [63] lubridate_1.7.4 knitr_1.20
## [65] bindr_0.1.1 rprojroot_1.3-2
## [67] stringi_1.2.4 Rcpp_1.0.0
## [69] rgl_0.99.16 DEoptimR_1.0-8
## [71] tidyselect_0.2.5